Regression Analysis
Logistic Regression
Learning objectives:
What is logistic regression?
Logistic regression is a type of supervised learning algorithm used for classification tasks. It is a linear model estimating the probability of an instance belonging to a particular class.
In logistic regression, the goal is to predict a binary outcome (e.g., a yes/no, 0/1, true/false) based on one or more input features. We may use logistic regression to predict if a person will likely default on a loan based on their credit score, income, and other factors.
The basic idea behind logistic regression is to find the line (or hyperplane in higher dimensions) that maximally separates the positive and negative classes.
We can find the line by training a model to predict the probability that an instance belongs to the positive class given its features. The model can then make predictions by thresholding this probability at a specific value (e.g., 0.5).
Mathematically, the logistic regression model estimates the probability of an instance belonging to the positive class using the sigmoid function:
\[P(y=1|x) = {1 \over (1 + exp(-wx))}\]
where \(x\) is the input features and \(w\) is the model weights. The sigmoid function maps any real-valued number to the range [0, 1], which makes it a good choice for modeling probabilities.
To train the model, we need to adjust the weights so that the model makes good predictions on the training data. We can adjust the weights using an optimization algorithm, such as gradient descent, that minimizes a loss function that measures how well the model performs.
One standard loss function used in logistic regression is the cross-entropy loss, which is defined as:
\[L(y, P(y|x)) = -(y * log(P(y|x)) + (1 - y) * log(1 - P(y|x)))\]
where \(y\) is the true label (either 0 or 1) and \(P(y|x)\) is the predicted probability that \(y=1\) given the input features \(x\).
Once the model is trained, it can make predictions on new instances by plugging the input features into the model and thresholding the probability at 0.5. For example, if the model predicts a probability of 0.8 that an instance belongs to the positive class, we can predict that the instance belongs to the positive class.
Binary logistic regression
Binary logistic regression is a statistical method for predicting a binary outcome. To say it another way, it is a method predicting an outcome with only two possible values (e.g., success/failure, sick/healthy, and so on) based on one or more predictors.
The logistic regression model estimates the probability that an event will occur (e.g., the probability that an individual will develop a specific disease) given the predictor variables' values.
The model then converts the predicted probability into a binary outcome (e.g., 0 or 1, "not sick" or "sick") by using a threshold value of 0.5: if the predicted probability is greater than or equal to 0.5, the outcome is 1 (e.g., "sick"); if the predicted probability is less than 0.5, the outcome is 0 (e.g., "not sick").
To fit a binary logistic regression model, we need to find the values of the model parameters (i.e., the coefficients of the predictor variables) that maximize the likelihood of the data. In other words, we want to find the values of the model parameters that make it most likely that the observed data occurred.
We can do the fitting using an optimization algorithm to minimize the loss function, which measures how well the model fits the data.
Several assumptions must be satisfied for the results of a binary logistic regression to be valid. These include
- the assumption of independence of observations
- the assumption of linearity of predictors in the log odds
- the assumption that the errors are distributed according to a logistic distribution
If these assumptions are not satisfied, the model's results may be biased.
Binary logistic regression can predict a binary outcome based on a single predictor variable (simple logistic regression) or multiple predictor variables (multiple logistic regression). It is a widely used method in many fields apart from bioinformatics, including medicine, finance, and psychology.
Multiclass logistic regression
Multiclass logistic regression is a supervised learning algorithm for classification tasks with more than two classes. It is an extension of the logistic regression algorithm used for binary classification.
In logistic regression, we try to predict the probability of an event occurring, given some input features. The output is a probability between 0 and 1, which can be interpreted as the likelihood of the event occurring.
In multiclass logistic regression, we extend this idea to classify samples into more than two classes. For example, we might have a dataset of images of animals, and we want to classify each image as a dog, cat, or bird. In this case, we would have three classes: "dog," "cat," and "bird."
One way to implement multiclass logistic regression is to train one logistic regression model for each class and then use these models to make predictions for a new sample.
For example, we could train one model to predict the probability that an image is a dog, another model to predict the probability that an image is a cat, and another model to predict the probability that an image is a bird.
To predict a new sample, we would use all three models to predict the probabilities for each class and then select the class with the highest probability as the prediction.
Another way to implement multiclass logistic regression is to use what is known as a "one-vs-all" approach. In this approach, we train one logistic regression model for each class, but instead of predicting the probability of each class, we predict whether or not the sample belongs to that class.
For example, we might train one model to predict whether an image is a dog (1) or not a dog (0), another model to predict whether an image is a cat (1) or not a cat (0), and another model to predict whether an image is a bird (1) or not a bird (0).
To predict a new sample, we would use all three models to predict the probability that the sample belongs to each class and then select the class with the highest probability as the prediction.
Both of these approaches can be effective for multiclass logistic regression, and the choice of which approach to use will depend on the specific task and the available data.
Regularization in logistic regression
It works by finding the weights that minimize the error between the predicted probabilities and the true labels of the training examples. However, with some regularization, the model can fit the training data and perform better on unseen examples.
Regularization is a method used to address overfitting in logistic regression. Regularization prevents overfitting by including a penalty term to the objective function that is being minimized.
The goal is to constrain the size of the coefficients or weights so that they do not become too large. Regularization helps to prevent the model from becoming overly complex and sensitive to the noise in the training data.
There are two main types of regularization in logistic regression: L1 and L2.
L1 regularization, also known as "Lasso regularization." The Lasso regularization tries to prevent overfitting by including a penalty term to the objective function equal to the absolute value of the weights. The objective function becomes:
\[Loss + λ \times \sum |w|\]
where \(λ\) is the regularization parameter, which controls the strength of the regularization. A larger value of \(λ\) means that the regularization term has a greater influence on the objective function, and the weights are more heavily constrained.
L2 regularization, also known as "Ridge regularization." The Ridge regularization adds a penalty term to the objective function equal to the square of the weights, and the objective function becomes:
\[Loss + λ \times \sum w^2\]
Like L1 regularization, L2 regularization also uses a regularization parameter, \(λ\), to control the strength of the regularization. However, the penalty term in L2 regularization is the square of the weights rather than the absolute value.
Both L1 and L2 regularization can be helpful in logistic regression, depending on the specific problem and the desired properties of the model. L1 regularization is generally preferred when selecting a small number of essential features, as it has the property of setting some weights to precisely zero.
L2 regularization, on the other hand, is generally preferred when we want to avoid overfitting and preserving all the features, as it has the property of making all the weights non-zero but smaller.
Regularization is vital when training a logistic regression model, as it can help improve the model's generalization performance and prevent overfitting the training data.
Evaluating the performance of logistic regression models
Assessing the performance of a logistic regression model is a major, vital step in the process of model building.
The performance of a logistic regression model can be evaluated in several ways, and the choice of evaluation metric depends on the analysis's specific circumstances and goals. Here are some standard evaluation metrics for logistic regression models:
- Classification Accuracy:
- This is the most common evaluation metric for classification models. It is the ratio of the number of correct predictions made by the model to the total number of predictions made. It is a simple and intuitive metric but can be misleading in imbalanced classification problems.
- Confusion Matrix:
- A confusion matrix, or table, illustrates the model's prediction result by the counts of true positive, true negative, false positive, and false negative. It is a valuable tool for understanding the model's errors and identifying patterns in the data.
- Precision and Recall:
- Precision means the ratio of true positive predictions made by the model to the total number of positive predictions made. The recall is the proportion of true positive predictions made by the model to the total number of actual positive cases in the data. Precision and recall are helpful metrics for imbalanced classification problems, as they provide a more comprehensive picture of the model's performance.
- F1 Score:
- The F1 score is the harmonic mean of precision and recall. It is a balance between precision and recall, and it is a useful metric for imbalanced classification problems.
- AUC-ROC Curve:
- The AUC-ROC curve is a graphical representation of the model's performance. The curve is a plot of the true positive versus the false positive rate at different classification thresholds. The AUC (Area Under the Curve) measures the model's overall performance, with a value of 0.5 indicating a poor model and a value of 1 indicating a perfect model.
It is crucial to evaluate the performance of a logistic regression model using a variety of evaluation metrics, as no single metric can capture the complete picture of the model's performance.
It is also essential to consider the analysis's specific goals and choose evaluation metrics that are appropriate for those goals.
References
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288. 2346178.